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SCIENCE  OF  DECISION  MAKING:  A  DATA-MODELING  APPROACH 


1.  INTRODUCTION 

Peptide  mass  fingerprinting  (PMF)-based  identification  algorithms  using  mass 
spectrometry  (MS)  data  were  developed  in  the  early  1990s.  One  of  the  early  PMF  algorithms, 
SEQUEST  (Yates  Lab,  The  Scripps  Research  Institute;  La  Jolla,  CA),  was  widely  used  by  the 
scientific  community  to  decipher  the  sequence  information  of  peptides  generated  from  tandem 
MS  analysis.  This  software  is  commercially  available  and  solely  distributed  by  Thermo  Fisher 
Scientific  (Tewksbury,  MA)  (1,2).  Later,  other  PMF  algorithms  were  developed  and  reported  in 
literature.  These  included  Mascot  from  Matrix  Science,  Inc.  (Boston,  MA)  and  open-source 
software  such  as  the  open  mass  spectrometry  search  algorithm  (OMSSA)  from  the  National 
Center  for  Biotechnology  Information  (Bethesda,  MD)  and  XlTandem  from  The  Global 
Proteome  Machine  Organization,  which  is  an  online  database  (3-5).  These  algorithms  assign  a 
peptide  sequence,  along  with  a  matching  score  of  the  experimental  ion  product  mass  spectrum,  to 
a  theoretical  ion  product  mass  that  is  derived  from  the  protein  sequences  in  a  given  proteome 
database.  The  resulting  peptide-spectrum  match  (PSM)  score  is  computed  by  either  descriptive , 
interpretative,  stochastic,  or  probabilistic  modeling  methods  and  are  used  to  provide 
discrimination  between  true-positive  (TP)  and  false-positive  (FP)  peptide  identification  (6).  The 
aforementioned  PMF  algorithms  have  an  inherent  extensive  computational  time  requirement  that 
becomes  cumbersome  for  high-throughput  proteomic  analysis.  Therefore,  there  is  a  need  to 
overcome  such  obstacles  in  data  analysis  procedure  throughout  the  development  of  relatively 
rapid  tools  that  are  capable  of  identification  and  classification  of  microbes  in  near  realtime 
settings. 


The  PMF  algorithms  create  overlap  between  TP  and  FP  peptide  identifications 
(7),  whereby  the  identified  FP  peptides  lower  the  overall  confidence  for  the  identified  TP 
peptides.  Keller,  et  al.  developed  the  PSM-scoring  algorithm  on  the  basis  of  machine  learning 
methods  such  as  linear  discriminant  analysis  (3).  This  algorithm  provides  a  higher  confidence 
level  in  peptide  identification  because  each  spectrum  is  discriminated  by  weighing  each  vector 
feature  and  by  providing  a  relative  weight  to  that  peptide  MS  spectrum.  Many  researchers  use  a 
decoy  database  that  contains  reversed  protein  sequences  to  score  the  FP  peptide  identification 
and,  thereby,  compute  a  false-discovery  rate  (FDR)  score  (9).  This  decoy  database  has  two 
limitations:  (a)  the  database  search  time  is  doubled,  and  (b)  a  suitable  decoy  database  cannot  be 
generated  for  all  applications,  especially  when  the  researcher  is  not  doing  a  targeted  database 
search  (10). 


Gupta,  et  al.  (2011)  stated  “that  target-decoy  approach  (TDA)  is  not  needed  when 
accurate  p-values  of  individual  peptide-spectrum  matches  are  available”  (11).  Moreover,  when 
using  a  decoy  database  it  is  difficult  to  maintain  the  mass  and  amino  acid  composition  of  the 
target  and  decoy  peptides. 
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To  overcome  the  issues  of  over-fitting  a  vector  feature  of  the  spectrum  and  the  use 
of  a  decoy  database,  and  to  lower  dynamically  the  FDR  score,  we  have  developed  a  parallel  data 
analysis  algorithm  called  “Merlin”.  Merlin  can  be  used  to  analyze  PSM  results  from  the 
SEQUEST  and  OMSSA  algorithms,  and  it  has  the  potential  to  analyze  results  from  other  PMF 
algorithms  as  well. 

The  Merlin  algorithm  employs  multiple  scores  such  as  cross-correlation  (Xcorr), 
preliminary  score  (Sp),  and  mass  differences  coefficient  (ACn)  from  SEQUEST.  Merlin  also 
incorporates  the  probability  value  (p- value)  from  PeptideProphet  and  the  expected  value  (E- 
value)  from  OMSSA  to  compute  the  most  probable  PSM  for  the  identification  and  classification 
of  an  organism  in  the  analyzed  experimental  sample.  (The  E-value  is  a  parameter  that  describes 
the  number  of  successes  expected  when  searching  a  database  of  a  particular  size.) 

In  the  future,  we  plan  to  incorporate  other  open-  and  closed-source  PMF 
algorithms  to  provide  a  robust  and  automated  PMF  algorithm  such  as  Merlin,  which  would  be 
capable  of  improving  the  confidence  score  of  identified  peptides  during  the  proteomics  data 
processing.  This  algorithm  could  be  integrated  within  the  U.S.  Army  Edgewood  Chemical 
Biological  Center  in-house-developed  microbial  identification  tool,  ABOid  (12). 


2.  METHODS 

2.1  Escherichia  coli  Strain  Q157:H7  Sample  Preparation 

The  E.  coli  strain  0157:H7  was  grown  in  trypticase  soy  broth,  contained  in  an 
orbital  shaker  (125  rpm)  at  37  °C,  until  the  bacteria  reached  the  late  exponential  phase 

o 

(~10  cfu/mL).  The  cell  culture  was  stored  at  4  °C  until  it  reached  fractionation.  To  isolate  the 
secreted  protein  fractions,  30  mL  of  culture  was  centrifuged  at  1 1 ,300xy  for  1  h  using  a 
Beckman  J2-MC  centrifuge  (Indianapolis,  IN).  The  supernatant  was  decanted  to  separate  it  from 
the  pellet.  This  supernatant,  which  contains  the  secreted  proteins,  is  referred  to  as  the  secreted 
fraction.  The  pellet  was  resuspended  in  -3.5  mL  of  100  mM  ammonium  bicarbonate  (ABC). 
This  extracellular  suspension  was  divided  into  three  aliquots  of  approximately  equal  volume. 
The  cell  pellet  extracellular  samples  were  thawed  and  lysed  by  ultrasonication  (25  s  on  and  5  s 
off  for  a  total  of  4  min)  using  a  Branson  Digital  Sonifier  (Danbury,  CT).  The  lysate  was 
centrifuged  at  14,000  rpm  for  20  min  at  10  °C  using  a  Beckman  GS-15R  centrifuge.  Samples 
were  frozen  at  -25  °C  for  up  to  4  days. 

2.2  Bacterial  Sample  Processing 

Samples  were  prepared  for  liquid  chromatography  (LC)  tandem  MS  (LC- 
MS/MS)  in  a  similar  manner  to  that  previously  reported  by  Jabbour  et  al.  (13).  Proteins  were 
extracted  from  the  secreted  fractions  by  transferring  each  sample  to  a  separate  Microcon  YM-3 
filter  unit  (Millipore,  Billerica,  MA)  and  centrifuging  the  samples  at  1 4, 1  OOxg  for 
20-30  min.  The  filters  were  each  centrifuged  three  times  at  14,000xg  for  25  min,  with  a  200  pL 
ABC  wash  between  centrifugations. 
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The  proteins  in  the  retentate  were  denatured  at  40  °C  for  1  h  with  270  pL  of 

7.2  M  urea  and  30  pg/mL  dithiothreitol  in  ABC.  The  urea  was  removed  by  centrifugation  at 
14,100xg  for  30-40  min.  The  retentate  was  washed  three  times  using  150  pL  ABC,  followed  by 
centrifugation  at  1 4, 1 00 xy  for  30-40  min  using  an  Eppendorf  centrifuge  5415C  or  5415D 
(Eppendorf  North  America;  Westbury,  NY).  The  filter  unit  was  then  transferred  to  a  new 
receptor  tube  and  the  proteins  in  the  retentate  were  digested  overnight  at  37  °C  with  5  pL 
sequencing-grade  trypsin  (Product  No.  51 1A;  Promega;  Madison,  WI)  in  10  pL  acetonitrile  and 
240  pL  ABC.  The  tryptic  peptides  were  isolated  by  centrifuging  at  1 4, 1 00 xy  for  20-30  min. 

2.3  LC-MS/MS  Experiments 

In  a  manner  similar  to  that  previously  described  by  Jabbour  et  al.  (13),  the  tryptic 
peptides  were  separated  on  a  capillary  column  using  the  Dionex  UltiMate  3000  (Sunnyvale,  CA). 
The  resolved  peptides  were  then  sprayed  into  a  linear  ion  trap  MS  (LTQ  XL;  Thermo  Scientific; 
San  Jose,  CA).  The  product  ion  mass  spectra  were  obtained  using  the  data-dependent  acquisition 
mode  with  a  survey  scan,  followed  by  performing  an  MS/MS  evaluation  on  the  top  five  most- 
intense  precursor  ions. 


3.  RESULTS  AND  DISCUSSION 

3.1  Database  Search  and  Data  Analysis 


A  proteome  database  was  constructed  in  a  FASTA  format  derived  from  the  E.  coli 
0157:H7  strain  Sakai  genome  obtained  from  the  National  Center  for  Biotechnology  Information 
(NCBI)  genomic  database  repository  (http://www.ncbi.nlm.nih.gov,  accessed  August  14,  2012). 
The  constructed  proteome  database  included  115  protein  sequences  from  all  potential  laboratory 
contaminants.  The  constructed  proteome  database  also  consisted  of  5433  proteins  that  were  used 
in  this  study.  The  targeted  proteins  in  the  proteome  database  were  in  silico  digested  using  trypsin 
to  perform  enzymatic  cleavage  and  to  obtain  the  theoretical  product  ion  spectra  of  all  potential 
peptides.  Then,  the  proteome  database  was  indexed  in  FASTA  format  for  compatibility  with  the 
examined  algorithms  (SEQUEST  and  OMSSA)  listed  in  Table  1. 

The  experimental  product  ion  spectra  in  *.RAW  file  format  were  obtained  using 
the  LTQ  XL  MS  and  converted  into  the  mass-to-charge  extensible  markup  language  (mzXML) 
format  using  a  file-conversion  tool  developed  by  Seattle  Proteome  Center  at  the  Institute  of 
System  Biology  (Seattle,  WA)  (14).  Three  replicate  suspension  samples  analyzed  on  tandem 
MS/MS  that  resulted  in  43501  MS/MS  spectra  were  searched  against  the  constructed  proteome 
database  according  to  the  parameters  listed  in  Table  1  and  including  two  additional  parameters: 
(a)  mass  tolerance  of  2.50000  amu  and  (b)  fragment  ion  tolerance  of  1.00000  amu. 
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Table  1.  Protein  Sequence  In  Silico  Digestion  Parameters 


Parameter 

Value 

FAST  A  database 

EC_Sakai.fasta 

FAST  A  index 

EC_S  akai  .f asta .  idx 

FASTA  digest 

EC_Sakai.fasta.dgt 

Enzyme  name 

Trypsin  (KR) 

Mass  range 

600-3500  m/z 

Sequence  length 

5-35 

Mass  type 

Monoisotopic 

Missed  cleavage  sites 

2 

The  MS  files  (.RAW  files)  were  submitted  to  SEQUEST.  The  same  files  that 
were  converted  into  mzXML  files  were  submitted  to  OMSSA  for  database  searching.  The  output 
from  the  SEQUEST  database,  without  any  threshold  cutoff,  were  submitted  to  the  ABOid 
software  to  derive  the  probability  score  of  peptides  production  ion  spectra  and  then  converted 
into  a  comma-separated  file  format  (.csv)  to  concatenate  the  spectral  files  into  one  file.  The  PSM 
results  from  the  OMSSA  database  were  also  exported  to  a  .csv  file  format.  The  .csv  files 
generated  from  SEQUEST  contain  information  such  as  the  scan  number,  peptide,  Xc,)IT,  Sp,  ACn, 
RSp  (rank  score),  M+H  (molecular  ion),  protein  name,  and  accession  number.  The  .csv  files 
generated  by  the  OMSSA  database  contains  information  like  scan  number,  peptide  sequence, 
peptide  mass,  protein  name  and  mass,  accession  number,  E-value,  and  p-value. 

For  each  analyzed  sample,  the  .csv  files  resulting  from  the  OMSSA  and 
SEQUEST  algorithms  were  also  submitted  through  the  Merlin  algorithm  to  extract  the  common 
proteins  identified  previously.  The  common  proteins,  their  weighing  factors,  and  database 
parameters  were  submitted  again  to  the  ABOid  software  for  identification  and  computation  of 
the  probability  score.  Peptide  sequences  with  probability  scores  of  95%  and  higher  were  retained 
and  used  to  generate  a  binary  matrix  of  sequence-to-bacterium  assignments.  The  binary  matrix 
was  populated  by  matching  the  peptides  with  corresponding  proteins  in  the  constructed  proteome 
database  and  assigning  a  score  of  one  for  a  match  and  zero  for  a  mismatch.  The  columns  in  the 
binary  matrix  represent  the  proteome  of  bacteria  and  contaminants  in  the  database,  the  rows 
represent  identified  tryptic  peptide  sequences  that  were  obtained  from  tandem  MS  spectral 
processing.  A  sample  microorganism  is  matched  with  a  database  bacterium  by  the  number  of 
unique  peptides  that  remained  after  filtering  of  the  degenerate  peptides  from  the  binary  matrix. 
Verification  of  the  classification  and  identification  of  candidate  microorganisms  is  performed 
through  hierarchical  clustering  analysis  and  taxonomic  classification  as  shown  by  the  ABOid 
software.  The  flowchart  for  the  Merlin  process  is  shown  in  Figure  1. 
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Figure  1.  Flow  chart  for  Merlin  algorithm. 


Table  2  shows  the  total  number  of  unique  proteins  observed  from  the  two 
database-searching  algorithms  and  the  common  proteins  identified  by  the  Merlin  algorithm. 
Although  the  number  of  common  proteins  identified  using  Merlin  was  relatively  lower  than  that 
of  the  other  algorithms,  the  identification  score  for  the  bacteria  was  higher  using  the  protein  list 
from  Merlin  than  that  of  the  other  algorithms. 


Table  2.  Database-Searching  Comparison  and  Number  of  Unique  Proteins  Observed 


Sample  ID 

Spectra 

SEQUEST 

OMSSA 

Merlin 

2011-02-10-02 

13265 

103 

107 

80 

2011-02-10-03 

14728 

82 

92 

68 

2011-02-10-05 

15508 

119 

140 

101 
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Figure  2  shows  a  Venn  diagram  of  the  analyzed  bacterial  samples  and  the  number 
of  candidate  proteins  identified  using  the  Merlin  algorithm.  The  results  showed  an  increase  of 
common  proteins  in  replicate  analyses  with  Merlin,  which  was  lower  using  the  other  algorithms 
individually. 


Figure  2.  Venn  diagram  for  Merlin  results  obtained  from  the  replicate  bacterial  samples. 


4.  CONCLUSIONS 

This  study  showed  that  the  use  of  a  single  PMF  algorithm  could  result  in  a  higher 
FDR  value  when  compared  with  a  combinatorial  approach  that  concurrently  retains  spectral 
information  from  diverse  individual  algorithms.  This  conclusion  was  based  on  statistical 
confidence  using  Bayesian  and  Gaussian  PMF  algorithms  to  lower  the  FP  rate  and  eliminate  the 
data  analysis  bottleneck. 

Additional  studies  are  needed  to  incorporate  the  de  novo  analysis  and  identify  the 
best-fitting  peptides  without  using  database-searching  tools.  The  Poisson  distribution  should  be 
used  to  match  de  novo  output  with  peptides  identified  by  database- searching  tools.  In  addition, 
the  incorporation  of  a  receiver-operating  characteristics  curve  will  enable  computation  of  the 
probability  cutoff  value  for  analysis.  Such  studies  will  expand  the  algorithms  to  provide 
enhanced  selectivity. 
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ACRONYMS  AND  ABBREVIATIONS 


ACn 

mass  differences  coefficient 

ABC 

ammonium  bicarbonate 

E-value 

expected  value 

FDR 

false-discovery  fate 

FP 

false-positive  (peptide  identification) 

FC 

liquid  chromatography 

FC-MS/MS 

liquid  chromatography-tandem  mass  spectrometry 

MS 

mass  spectrometry 

mzXMF 

mass-to-charge  extensible  markup  language 

OMSSA 

open  mass  spectrometry  search  algorithm 

PMF 

peptide  mass  fingerprinting 

PSM 

peptide-spectrum  match 

p-value 

probability  value 

RSp 

rank  score 

sP 

preliminary  score 

TP 

true-positive  (peptide  identification) 

Xcorr 

cross-correlation 
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